Predict Bike Rentals

Posted on Dim 23 septembre 2018 in Machine Learning

Predicting Bike Rentals in Washington D.C¶

The data set contains 17380 rows of bike rentals on a single hour.

The goal is to predict the total number of bikes rented in a given hour ("cnt" hour).

In [10]:

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [11]:

bike_rentals = pd.read_csv('bike_rental_hour.csv')
print(bike_rentals.head())

   instant      dteday  season  yr  mnth  hr  holiday  weekday  workingday  \
0        1  2011-01-01       1   0     1   0        0        6           0   
1        2  2011-01-01       1   0     1   1        0        6           0   
2        3  2011-01-01       1   0     1   2        0        6           0   
3        4  2011-01-01       1   0     1   3        0        6           0   
4        5  2011-01-01       1   0     1   4        0        6           0   

   weathersit  temp   atemp   hum  windspeed  casual  registered  cnt  
0           1  0.24  0.2879  0.81        0.0       3          13   16  
1           1  0.22  0.2727  0.80        0.0       8          32   40  
2           1  0.22  0.2727  0.80        0.0       5          27   32  
3           1  0.24  0.2879  0.75        0.0       3          10   13  
4           1  0.24  0.2879  0.75        0.0       0           1    1

In [3]:

bike_rentals["cnt"].hist(bins=50)
plt.show()
bike_rentals["cnt"].hist(bins=50, range=[0, 100])
plt.show()

In [4]:

print(bike_rentals.corr()["cnt"])

instant       0.278379
season        0.178056
yr            0.250495
mnth          0.120638
hr            0.394071
holiday      -0.030927
weekday       0.026900
workingday    0.030284
weathersit   -0.142426
temp          0.404772
atemp         0.400929
hum          -0.322911
windspeed     0.093234
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

Calculating Features¶

--> Enhance the accuracy of models by introducing new information

In [12]:

def assign_label(row_hour):
    
    if row_hour >= 0 and row_hour < 6:
        return 4
    elif row_hour >= 6 and row_hour < 12:
        return 1
    elif row_hour >= 12 and row_hour < 18:
        return 2
    elif row_hour >= 18 and row_hour <= 24:
        return 3
    
bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)
bike_rentals["time_label"].head(20)

Out[12]:

0     4
1     4
2     4
3     4
4     4
5     4
6     1
7     1
8     1
9     1
10    1
11    1
12    2
13    2
14    2
15    2
16    2
17    2
18    3
19    3
Name: time_label, dtype: int64

Train / Test Split¶

In [13]:

train = bike_rentals.sample(frac = .8)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]

Remove bad features¶

In [14]:

list_features = list(train.columns)
bad_features = ["cnt","casual","registered","dteday"]

for el in bad_features:
    list_features.remove(el)

Linear Regression for predicting bike rentals¶

In [15]:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(train[list_features], train["cnt"])
predictions = model.predict(test[list_features])

print(np.mean((predictions - test["cnt"]) ** 2))

17462.2988855

The error is very high, which may be due to the fact that the data has a few extremely high rental counts, but otherwise mostly low counts. High Rental counts could be considered as outliers data because there is a few amount of these data. Larger errors are penalized more with MSE, which leads to a higher total error.

Decision Trees¶

In [17]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import roc_auc_score

model_tree = DecisionTreeRegressor()

model_tree.fit(train[list_features], train["cnt"])
predictions = model_tree.predict(test[list_features])


print(np.mean((predictions - test["cnt"]) ** 2))

3388.80322209

In [23]:

model_tree = DecisionTreeRegressor(min_samples_leaf = 5)

model_tree.fit(train[list_features], train["cnt"])
predictions = model_tree.predict(test[list_features])


print(np.mean((predictions - test["cnt"]) ** 2))

2663.96872139

Using a non linear predictor is much better, we have an higher accuracy than linear regression.

Random Forests : Improve the Decision Tree prediction¶

In [37]:

rf = RandomForestRegressor(random_state=1, min_samples_leaf = 2)

rf.fit(train[list_features], train["cnt"])
predictions = rf.predict(test[list_features])

print(np.mean((predictions - test["cnt"]) ** 2))

1875.74067529

The accuracy of Random Forests is higher than Decision Trees because it removes sources of Overfitting.